-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: IO support for R data files with C extension #41386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ParfaitG is this a different format than pyarrow reads/writes? (on the r side) sorry it obviously is, what i mean is, are there utilities / code in the pyarrow r project to read write this format?
doc/source/user_guide/io.rst
Outdated
.. ipython:: python | ||
|
||
rda_file = os.path.join(file_path, "env_data_dfs.rda") | ||
env_dfs = pd.read_rdata(rda_file, select_frames=["sea_ice_df"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah for read_hdf
we do this by forcing the user to have a key. are these always an ordered list? or a keyed list?
@jreback - These binary formats (RData, rda, rds) are part of |
What is the status here? I see no further response. I can rebase but this ENH may not make it into v1.3. I fully understand that this PR with new C extension is pretty involved and we want to be conservative on large enhancements in |
@ParfaitG have been busy but will take a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments on setup.
setup.py
Outdated
if name == "io.rdata._rdata" and is_platform_mac(): | ||
# non-conda builds must adjust paths to libiconv .h and lib dirs | ||
include = [ | ||
os.path.join(os.environ["CONDA_PREFIX"], "include"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you check if CONDA_PREFIX is defined? If it isn't, presumably this isn't conda.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this catch. Turns out this if
block is not needed. I removed it.
setup.py
Outdated
@@ -364,6 +370,12 @@ def run(self): | |||
# https://github.com/pandas-dev/pandas/issues/35559 | |||
extra_compile_args.append("-Wno-error=unreachable-code") | |||
|
|||
# rdata requires system iconv library | |||
os.environ["DYLD_LIBRARY_PATH"] = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overriding the users DYLC_LIBRARY_PATH feels like the wrong solution. Why is this needed? Does this change this path for the duration of the console session, even outside of the setup run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is not needed and has been removed.
@@ -364,6 +370,12 @@ def run(self): | |||
# https://github.com/pandas-dev/pandas/issues/35559 | |||
extra_compile_args.append("-Wno-error=unreachable-code") | |||
|
|||
# rdata requires system iconv library | |||
os.environ["DYLD_LIBRARY_PATH"] = "" | |||
rdata_includes = ["/usr/include"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably these should be no on Windows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. This section is under the if is_platform_mac()
condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to push all of the c-code to a separate package? why is this a good (or not good solution in this case)? i am concerned that this is just creating even bigger wheels and more maintenance for all users.
If you recall, we explored a Python package for this IO module to read R data files but it had a restrictive license. See previous PR in my first checked item at top. That package was a wrapper to this same C library that we directly interfaced with here, avoiding another soft dependency for pandas. Regarding size, this librdata proclaims itself as a small, lightweight C library. Its source files with two added scripts for Windows totals 177 KB (slightly larger than |
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
pandas.read_rdata
andDataFrame.to_rdata
#40884Proposed RData I/O module interfaces to the C library: librdata.
Overall, this PR includes following changes:
DataFrame.to_rdata
)Note: special handling of
iconv
, a system resource built-in to Unix machines, is required:iconv.h
header of the GNU C library is included./usr/include
and/usr/lib
, which may differ from users installs.iconv
is not built-in, two counterpart files (.h and .c script) were added from this repo, win-iconv, where its readme indicates code is placed in the public domain.